When I was trying to open the dataset with normal read.csv function I got the following error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate ‘row.names’ are not allowed
I opened the .csv with a normal text editor and counted the number of columns and values on each row. The number was correct in both cases, but there was something strange at the end of each row. They had a comma sign, so R was showing an error because there was no one-to-one relantionship.
Using function parameters like row.names=NULL fixes the loading process, but it screws the relationshipt between each column and the corresponding data. So I finally decided to use normal UNIX tools to fix the data set. This is what the original data looks like:
C00575795,“P00003392”,“Clinton, Hillary Rodham”,“KLEEMAN, ANNETTE”,“SANTA MONICA”,“CA”,“904021336”,“N/A”,“RETIRED”,100,04-MAY-16,“”,“X”,“* HILLARY VICTORY FUND”,“SA18”,“1079219”,“C5503390”,“P2016”,*
And this is what we need:
C00575795,“P00003392”,“Clinton, Hillary Rodham”,“KLEEMAN, ANNETTE”,“SANTA MONICA”,“CA”,“904021336”,“N/A”,“RETIRED”,100,04-MAY-16,“”,“X”,“* HILLARY VICTORY FUND”,“SA18”,“1079219”,“C5503390”,“P2016”
To fix the dataset I did the following:
$ cat P00000001-CA.csv | sed 's/.$//' > P00000001-CA-fixed.csv
P00000001-CA-fixed.csv is the file that I am going to use.
These two graphs show the number of contributions received grouped by different factors. In first graph we can see that Clinton and Sanders are both the ones with the largest number of contributions received. Both are democrats, so the second graph doesn’t really suprise us. Democrats are the ones that receive more contributions in California. Also, taking in account that they won the last two elections in that State, it shows that the party is really strong there.
Here I divided the people in different groups:
We see some trends. For example, seems that Clinton is the favourite of retired people. While in all other groups Sanders seems to be the winner in the number of contributions received. Clinton already said that she is against the idea of raising the retirement age, but I don’t think that opinion affect retired people since it doesn’t really affect them anymore. But it is still interesting to see that retired people are the ones that don’t follow what seemst to be the trend of contributing more times to Sanders.
Investigating the data by City, taking the 5 largest ones, we see again democrats dominating the numbers being Sanders again the one with more contributions received. At this moment, looks pretty clear that California is a Democrat state.
The dataset includes 18 variables with 653397 observations. It includes contributions made by people to different candidates. People are defined by their name, the city where they life (and its zip code), ammount of money given, employer, occupation, receipt date and some other variables used to identify the contribution itself.
## [1] 653397 19
## 'data.frame': 653397 obs. of 19 variables:
## $ cmte_id : Factor w/ 24 levels "C00458844","C00500587",..: 6 7 6 7 7 6 6 7 7 6 ...
## $ cand_id : Factor w/ 24 levels "P00003392","P20002671",..: 1 12 1 12 12 1 1 12 12 1 ...
## $ cand_nm : chr "Clinton" "Sanders" "Clinton" "Sanders" ...
## $ contbr_nm : Factor w/ 111178 levels "& DREW BURKE, MELANIE",..: 52929 56715 33796 57448 57448 69571 4518 57462 57470 98984 ...
## $ contbr_city : Factor w/ 1488 levels "","*MORENO VALLEY",..: 1202 186 1399 1077 1077 226 1055 1412 1442 1269 ...
## $ contbr_st : Factor w/ 1 level "CA": 1 1 1 1 1 1 1 1 1 1 ...
## $ contbr_zip : Factor w/ 93931 levels "","00000","000090272",..: 12352 43904 43667 11331 11331 65865 24612 29931 40813 85363 ...
## $ contbr_employer : Factor w/ 37122 levels ""," APPLE INC.",..: 21803 2509 26638 35170 35170 21803 28897 23312 22578 30853 ...
## $ contbr_occupation: Factor w/ 16930 levels ""," REAL ESTATE BROKER",..: 12721 14127 12295 10681 10681 12721 7535 11666 9870 11600 ...
## $ contb_receipt_amt: num 100 40 80 35 100 ...
## $ contb_receipt_dt : Factor w/ 518 levels "01-APR-15","01-APR-16",..: 62 60 273 76 93 207 256 60 76 288 ...
## $ receipt_desc : Factor w/ 73 levels "","* EARMARKED CONTRIBUTION: SEE BELOW REATTRIBUTION/REFUND PENDING",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ memo_cd : Factor w/ 2 levels "","X": 2 1 2 1 1 2 2 1 1 2 ...
## $ memo_text : Factor w/ 285 levels "","$1,500 REFUNDED ON 2/3/16.",..: 38 14 38 14 14 38 38 14 14 38 ...
## $ form_tp : Factor w/ 3 levels "SA17A","SA18",..: 2 1 2 1 1 2 2 1 1 2 ...
## $ file_num : int 1079219 1077404 1079219 1077404 1077404 1079219 1079219 1077404 1077404 1079219 ...
## $ tran_id : Factor w/ 650821 levels "A000771210424405B8CF",..: 161363 422120 163199 423564 425865 162306 166006 421582 423560 163924 ...
## $ election_t : Factor w/ 4 levels "","G2016","P2016",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ party : Factor w/ 4 levels "democrat","republican",..: 1 1 1 1 1 1 1 1 1 1 ...
The summary of the dataset shows that Sanders is the candidate that received the largest number of contributions, but we still don’t know if that means the largest ammount of money. PENDERGAST, JAN is the one with more contributions, doing it 244 times. Los Angeles is the city with more contributions and unemployed are the ones that contributed more times. The median constribution is 27$ and the mean is 126$. There are big ouliners, for example negative contributions of -10000 (refunds) and positive of 10800 that we would need take in account.
## cmte_id cand_id cand_nm
## C00577130:371470 P60007168:371470 Length:653397
## C00575795:163216 P00003392:163216 Class :character
## C00574624: 57129 P60006111: 57129 Mode :character
## C00573519: 27342 P60005915: 27342
## C00458844: 14089 P60006723: 14089
## C00577312: 4696 P60007242: 4696
## (Other) : 15455 (Other) : 15455
## contbr_nm contbr_city contbr_st
## PENDERGAST, JAN : 244 LOS ANGELES : 48889 CA:653397
## MCLENNAN, MARLYN : 238 SAN FRANCISCO: 43991
## AUSLENDER, LEONARD: 230 SAN DIEGO : 23248
## WEIL, MONIQUE : 224 OAKLAND : 16394
## ISERI, MARTIN : 212 SAN JOSE : 15697
## SPEAR, JOSEPH : 210 BERKELEY : 12520
## (Other) :652039 (Other) :492658
## contbr_zip contbr_employer contbr_occupation
## 926372766: 280 NONE : 62886 NOT EMPLOYED: 99011
## 916055507: 277 RETIRED : 62531 RETIRED : 96464
## 950145153: 244 NOT EMPLOYED : 53310 TEACHER : 15016
## 946112036: 241 N/A : 42497 ATTORNEY : 14843
## 921034727: 238 SELF EMPLOYED: 35884 ENGINEER : 10008
## 900695947: 231 (Other) :395859 (Other) :417961
## (Other) :651886 NA's : 430 NA's : 94
## contb_receipt_amt contb_receipt_dt
## Min. :-10000.0 29-FEB-16: 11735
## 1st Qu.: 15.0 31-MAR-16: 11506
## Median : 27.0 31-MAY-16: 9916
## Mean : 126.1 30-APR-16: 9473
## 3rd Qu.: 75.0 09-MAR-16: 8887
## Max. : 10800.0 14-MAR-16: 8559
## (Other) :593321
## receipt_desc memo_cd
## :642970 :628525
## Refund : 4509 X: 24872
## REDESIGNATION FROM PRIMARY : 1323
## REDESIGNATION TO GENERAL : 1323
## REATTRIBUTION / REDESIGNATION REQUESTED: 569
## REATTRIBUTION TO SPOUSE : 529
## (Other) : 2174
## memo_text form_tp
## * EARMARKED CONTRIBUTION: SEE BELOW:353637 SA17A:629287
## :272415 SA18 : 19601
## * HILLARY VICTORY FUND : 19327 SB28A: 4509
## REDESIGNATION FROM PRIMARY : 1323
## REDESIGNATION TO GENERAL : 1323
## EARMARKED FROM MAKE DC LISTEN : 858
## (Other) : 4514
## file_num tran_id election_t
## Min. :1003942 A5602AD777C8C4632B5A: 4 : 138
## 1st Qu.:1066653 ADB49CB248C174E298F0: 4 G2016: 4958
## Median :1077404 A26C35A6066754130B99: 3 P2016:648294
## Mean :1070263 A340DF85B7F884133A20: 3 P2020: 7
## 3rd Qu.:1077665 A4E50E2DD07E4475996F: 3
## Max. :1079473 A7C22FA389E0348F98F0: 3
## (Other) :653377
## party
## democrat :535561
## republican :117562
## green : 197
## libertarian: 77
##
##
##
This dataset is all about money contributions made by people to different candidates. Usually different group of people, depending on where they live, occupation (that affects yearly income) have different favourite candidates. The main features here are the different variables that define people, being the most interesting ones the employment status, city where they live and the amount of money contributed. Most of the variables are not really useful, like transaction ids, forms, file ids… so we are going to concentrate on those I have already described.
Money helps a lot to win the elections. It provides the candidates with capacity to use ads, spread propaganda and make their ideas to reach a larger audience. It could be that the more money they receive, the highest the changes to win. So, the largest cities and also people with more money could decide who is going to be the winner. The question is, will the data show that trend?
In this first part I just counted the number of contributions, but that doesn’t show the full picture. Counting the total ammount of contributions per group will start to give us some more information and see what group contributes more to the winning candidate and which variables seems to have interesting outlines.
I created different dataframes to get data based on the variables I find more interesting:
The idea is to be able to use the data based on those particular variables without having to subset it every time I want to use it. That will make the code easier to write, read and mantain.
I also created the “Party” variable to specify to which party the candidate belongs to.
Unusual distribution and interesting data
The two top receiving number of contributions are Hillary Clinton and Bernard Sanders, both Democrats candidates.
If we keep investigating the number of contributions we also see that retired people prefer Hillary Clinton while the other groups go with Bernard Sanders.
Those without employment, and that usually means without that much money to spend, are the ones that contributed more times.
California is a democrat state.
Operations performed on the data
First graph shows that most of the contributions are bellow 2500$.
Next graphs tells us a totally different story the ones we saw in previous section. Sanders was the one receiving the largest number of contribution, but if we count the total amount of money: Clinton is getting nearly twice the money that Sanders receives.
In the other hand there are no surprises in the last graph. Democrats were the ones receiving more contributions and we see here they are also the winners in the total amount of money received.
Retired people contribute again to Cliton before every other candidate. Not employed ones are the only ones that don’t follow this trend. Most probably because Sanders proposed to:
expand Social Security, to the tune of $65 more per month on average, financed by raising payroll taxes on wealthy wage earnings. He’s also vowed to get the unemployed working again through a $1 trillion infrastructure plan the campaign says will create 13 million “good-paying” jobs.
Here we see even more interesting trendings, that makes our first analysis of total number of contributions complety useless. In these 5 cities, Sanders was doubling the number of contributions received. But if we count the money, as we do in first graph, we see that Clinton gets more than twice money. She is the clear winner here. Democrats still get much more money (as they got much more contributions) than other parties.
If we check the graph per city, we also see big differences similar to the one explained before. In Los Angeles, San Diego, San Francisco and San Jose, the graphs are nearly the opposite of what we say when we just counted the number of contributions. Being Clinton again the winner in amount (what really matters), but not in the number of contributions.
The last thing we see is that Fresno is the one that goes in the opposite direction from the other 4 biggest cities. Republicans win in the amount of money contributed and Sanders is the one getting more money.
After adding a y variable to our graphs (total ammount of money contributed) we see that the picture changed a lot.
Now, things are even more clear. Democrats are the first ones, and from Democrats Clinton is in the first place.
If we take in account the candidates, it varied a lot. It is clear that the number of contributes received doesn’t really correlate with the total amount of money. Sanders clearly gets the largest number of contributions, but Clinton gets the largest amount of money. Being both Democrats, it is still the winning party in the state.
Not employed people prefer Sanders, because of his proposals. Also, Fresno seems to follow a different pattern from other biggest cities and also the state of California. Fresno seems to be a Republic city.
Not employed people are, with no doubt, supporting Sanders. Seems to be the only group where Sanders has done a really good job.
The new layer that adds colour depending on the party who receives the contribution reinforces the idea that CA is mostly democrat. Republicans’ numbers are pretty low, so it is difficult to get information from them. Let’s remove democrats so we can have an easier to read picture of all others.
Cruz and Rubio are the favourite ones among Republicans.
The mean line added shows that there are many ouliners here making it a bit useless without some adjustments first.
Limiting the y from 0 to 1000 and adding an alpha of 1/100 we can get a better picture. Clinton and Sanders have so high number of contributions that they are really above the mean. Both candidates are per-se ouliners. We see that people tend to contribute with 100\(, 250\), 500\(, 750\) and 1000$. Those are round numbers, so they appear often.
There is not much new data, since we have already explored most of the useful variables of the data set. Adding parties as a third variable in our graphs helps us to see the big difference between Democrats and Republicans in California. Something we already found before but gets even more clear now.
Not really. The data we have seen was the expected one after previous analysis.
The data we are analysing includes the amount of money each candidate has received from individual contributors in California. Therefore, the first graph shown here is created to get a first overview of general contributions in California and try to get an overall idea. It shows the total amount of dolars received by each candidate.
In the graph we can see the total amount of dolars received by each candidate. Colors represent the party to which the candidate belongs to. Clinton and Sanders, both democrat, are the ones getting more financial help. The difference between the two candidates is also pretty large with Clinton receiving almost twice more money than Sanders.
Contributions received by each candidate
## ca$cand_nm: Bush
## [1] 3327044
## --------------------------------------------------------
## ca$cand_nm: Carson
## [1] 2952109
## --------------------------------------------------------
## ca$cand_nm: Christie
## [1] 456066
## --------------------------------------------------------
## ca$cand_nm: Clinton
## [1] 39364896
## --------------------------------------------------------
## ca$cand_nm: Cruz
## [1] 6283360
## --------------------------------------------------------
## ca$cand_nm: Fiorina
## [1] 1468489
## --------------------------------------------------------
## ca$cand_nm: Gilmore
## [1] 8100
## --------------------------------------------------------
## ca$cand_nm: Graham
## [1] 408595
## --------------------------------------------------------
## ca$cand_nm: Huckabee
## [1] 230890.6
## --------------------------------------------------------
## ca$cand_nm: Jindal
## [1] 23231.26
## --------------------------------------------------------
## ca$cand_nm: Johnson
## [1] 41187.6
## --------------------------------------------------------
## ca$cand_nm: Kasich
## [1] 1553824
## --------------------------------------------------------
## ca$cand_nm: Lessig
## [1] 186144.5
## --------------------------------------------------------
## ca$cand_nm: O'Malley
## [1] 297834.3
## --------------------------------------------------------
## ca$cand_nm: Pataki
## [1] 30450
## --------------------------------------------------------
## ca$cand_nm: Paul
## [1] 797624.3
## --------------------------------------------------------
## ca$cand_nm: Perry
## [1] 208400
## --------------------------------------------------------
## ca$cand_nm: Rubio
## [1] 4846484
## --------------------------------------------------------
## ca$cand_nm: Sanders
## [1] 18763935
## --------------------------------------------------------
## ca$cand_nm: Santorum
## [1] 36254.88
## --------------------------------------------------------
## ca$cand_nm: Stein
## [1] 27918
## --------------------------------------------------------
## ca$cand_nm: Trump
## [1] 501389.2
## --------------------------------------------------------
## ca$cand_nm: Walker
## [1] 495006.9
## --------------------------------------------------------
## ca$cand_nm: Webb
## [1] 76568.16
This graph also shows that democrat party is dominant in California with both candidates acting as ouliners in the graph. The rest of candidates’ numbers really low in comparison.
Number of contributions received by each party
##
## democrat republican green libertarian
## 535561 117562 197 77
Even if the graph looks pretty much straightforwad, we cannot just extrapolate the data to every single combination and group of people so in next graph I am going to show some extreme cases that actually show a complety different picture.
The data set includes information about the employment of each contributor. This is a very important piece of information, because people tend to vote based on their personal situation. So, it is pretty usual to see different groups supporting different candidates depending on the wealth or job.
I have divided the contributors in different groups:
The idea is to find some group that doesn’t follow the general picture saw in previous graph. So, in this second picture we can see that unemployed people are the only group that prefer Sanders over Clinton. The difference is also very large.
Contributions received by Sanders in total from everybody:
## subset(ca, cand_nm == "Sanders")$cand_nm: Sanders
## [1] 18763935
Contributions received by Sanders in total from unemployed:
## subset(not_employed, cand_nm == "Sanders")$cand_nm: Sanders
## [1] 4954404
Sanders receives 26% of contributions from this group. More information about why this could happen in reflection section.
Usually biggest cities have the biggest impact, just because of the population. So I have analyzed the data from the top 5 biggest cities in California and in the same way I did in graph two I tried to find something that goes against the ideas the first graph showed us.
In fact, we can see in this graph that Fresno is not democrat. Republican candidates get most of the contributions, being a totally different picture from the overall data of California. Actually, Fresno’s state and federal representation is mostly republican as we can see in the wikipedia links presented in next section.
I don’t live in the United States, so I started this project without background knowledge of their democratic system or preconceived ideas. While analysing the data I learn some very interesting information, summarized in the last three graphs from previous section. I used google to find news articles and wikipedia information to find that what I was discovering just checking the graphs was the reality.
Clinton is the favourite one among democrats. https://en.wikipedia.org/wiki/Nationwide_opinion_polling_for_the_Democratic_Party_2016_presidential_primaries
Sanders is the favourite one among non employeed people. https://heatst.com/politics/bernie-sanders-has-the-most-unemployed-donors/
CA is a Democrat state. https://en.wikipedia.org/wiki/California
Fresno goes in opposite direction from other big cities from the state. They are republicans. https://en.wikipedia.org/wiki/Fresno,_California
It is really amazing to see how you can discover real life facts and realities just by graphing numbers. Checking the selected three graphs also tell us something really important. The full picture, the graph of whole California, can’t be used as a perfect representation of each single city and group of people individually. There will be some that doesn’t follow the general rule and each because of different reasons.
I have encountered some problems while working with the data set. As I mentioned, most of the data is mostly useless and cannot be correlated with each other variable. Zip codes, candidate Ids, memo_cd, memo_text, file number and so on. So, I had to investigate only a small subset of those variables. To get some more information, I divided employment in groups and added the party of each candidate, so I could get some more data. But still, there were not many combinations. Actually taking in account that the topic was contributions made to politics, there is no much more information needed apart from money and some other metadata from the donors.
I was able to extract the data I was interested in and I learnt a lot about California, the different candidates and each political party.
For the future it could be even better if there was data about really big donations from enterprises. That would help us to see the usual relations between gas/oil/energy/technology companies and different candidates based on their ideas and future plans. Those big companies are the ones that really rule the world and their contributions will change it in a more drastic way that individuals and their donations.